Skip to content

ENH: Use find_stack_level in pandas.core #44358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 12, 2021

Conversation

rhshadrach
Copy link
Member

Part of #44347

  • Ensure all linting tests pass, see here for how to run them

This handles most of the stacklevels in core. A few more will take a bit more effort, separating out into a separate PR. After all stacklevels are replaced, will be removing check_stacklevel=False from the tests.

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending green(ish)

@jreback jreback added this to the 1.4 milestone Nov 11, 2021
@jreback
Copy link
Contributor

jreback commented Nov 11, 2021

hmm can you rebase again. nothing should be failing except the 3.10 sdist test (or there is an issue)

@jreback jreback merged commit 01b86ed into pandas-dev:master Nov 12, 2021
@jreback
Copy link
Contributor

jreback commented Nov 12, 2021

very nice @rhshadrach

@rhshadrach rhshadrach deleted the stacklevel branch November 12, 2021 03:56
nickleus27 pushed a commit to nickleus27/pandas that referenced this pull request Nov 28, 2021
@TomAugspurger
Copy link
Contributor

@rhshadrach
Copy link
Member Author

rhshadrach commented Dec 2, 2021

Most likely not - I'll do some benchmarking and see where this is coming from. If it's really widespread, might need to revert. I should be able to check this out in the next few days.

@rhshadrach
Copy link
Member Author

@TomAugspurger - Is there a way to get the table from the full list as tabular data?

@rhshadrach
Copy link
Member Author

rhshadrach commented Dec 4, 2021

After digging into this a bit, it does seem worth it to incur the slow down in at least a lot of these benchmarks. It takes about 3ms on my machine to call find_stack_trace. In a benchmark that runs in 10ms like the first example posted above, this is regarded as significant. However, in this particular instance, find_stack_level is called only once. I think that 3ms overhead is worth it to get a better error message to the user and to lessen maintenance (moving code around, calling from multiple places, etc).

@TomAugspurger @jreback - Is there a consensus that if find_stack_level is called O(1) times (with a small constant), then this is worth the overhead?

What would be concerning is if there is code where a warning is generated multiple times, making the overhead significantly greater. Indeed this is happening in at least two cases, and it should definitely be reverted. To track this, I've run the test suite while outputting (to a file) any test that hit find_stack_trace along with the number of times it was called. If the test is parameterized, I'm also storing the run id from pytest. The value_counts for calling find_stack_level from a single line of code in a test run more than once are:

calls
2        1378
3         320
4          70
5           2
6           6
7           2
9           2
10          8
12          5
14          2
24          1
28          1
38          1
50          1
56          3
102         1
103         1
300         2
dtype: int64

There can be lots of false positives here, e.g. a for loop in the test. Digging into the top offenders, below are the tests and on the right are comments from investigating the instance:

                                                                  calls
/tests/indexes/multi/test_indexing.py:879:none                      300  # Remove find_stack_level
/tests/indexes/multi/test_indexing.py:883:none                      300  # Remove find_stack_level
/tests/io/parser/conftest.py:29:c_low                               103  # Called many times
/tests/io/parser/conftest.py:29:c_high                              102  # Called many times
/tests/frame/test_constructors.py:1156:none                          56  # Called many times
/tests/frame/test_constructors.py:1163:none                          56  # Called many times
/tests/frame/test_constructors.py:1170:none                          56  # Called many times
/tests/indexes/multi/test_indexing.py:890:none                       50  # Remove find_stack_level
/tests/plotting/common.py:658:none                                   38  # Called many times
/tests/indexes/datetimes/test_partial_slicing.py:301:none            28  # Called many times
/tests/groupby/test_function.py:975:none                             24  # Remove find_stack_level
/tests/indexes/multi/test_duplicates.py:213:none                     14  # Remove find_stack_level
/tests/io/pytables/test_put.py:314:none                              14  # Removed find_stack_level
/tests/indexes/interval/test_constructors.py:91:breaks4-int64        12  # Called many times
/tests/indexes/interval/test_constructors.py:91:breaks5-int64        12  # Called many times
/tests/indexing/common.py:161:none                                   12  # Called many times
/tests/frame/test_reductions.py:161:none                             12  # Called many times
/tests/extension/base/getitem.py:248:integer-array                   12  # Called many times

Called many times means there is e.g. a for loop in the test itself, giving rise to false positives.The two cases I mentioned above that should be reverted are related to MultiIndexing and DataFrame.describe.

@TomAugspurger
Copy link
Contributor

@TomAugspurger @jreback - Is there a consensus that if find_stack_level is called O(1) times (with a small constant), then this is worth the overhead?

Yeah I think that's a worthwhile tradeoff.

@rhshadrach
Copy link
Member Author

Thanks @TomAugspurger. I've gone through the list of regressions associated with this PR. All regressions are either (a) less than 4ms or (b) call find_stack_level once (a few of these are timed just slightly greater than 4ms) except for:

All three of these no longer call find_stack_level after doing the reverts mentioned in #44358 (comment)

@rhshadrach rhshadrach mentioned this pull request Dec 4, 2021
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants